Goal and Video

The goal of this lab is to become familiar with the grammar of graphics philosophy implemented in the R package ggplot2. This is a newer version than basic R plots. However, it is a more powerful tool, and completely beginner friendly, so we won't go through the basic R-plots and will get you accustomed right away to this particular way of making plots. To do for Wednesday:

  • Begin by watching the first part of the lecture on data visualization (0 to 56 min -- we will watch the rest for Thursday) : this will provide you with an introduction to the importance of plotting and give you an overview of the different plotting functions with ggplot2. We will work through this lab during our discussion sessions on Wednesday and Thursday. You should run the different chunks of code, and observe the structure, as well as try tweaking the input to see how ggplot2 works.

  • Complement the lecture by reading this chapter on the ggplot grammar

  • If you are very unfamiliar with any plotting commands, you can also read the following introductory chapter on data visualization and watch the following series of very short youtube videos. Most of them use basic R code, but the arguments and the logic behind the plot is similar with ggplot2, so these are a good way of introducing the types of plots that you'll require:

    1. Histograms
    1. QQplots
    1. Boxplots
    1. Scatter plots
    1. Plots to avoid
    1. On qplot-- wrapper for very simple plots with ggplot2
  • Work through the first part of this lab by running all the R code to your computer and making sure that you understand the input and the output.

To do for Thursday:

  • Finish the video lecture

  • Go through the rest of the labs (appropriate sections).

Getting Started

We will be going through some of the topics from the ggplot2 documentation website, docs.ggplot2.org. You should keep that website open as a reference throughout this tutorial.

Another useful resource, the complete guide to the Grammar of Graphics philosophy, is available through Springer website (to which you have a free legal access with your Stanford ID).

Required Packages

In order to learn how to use ggplot2, we will use a few datasets available through the nutshell, parathyroidSE, EnsDb.Hsapiens.v86 packages.

The code below checks if all packages required for today's lab are available, and installs any missing ones. Note that (in the template file) the chunk declaration includes the following options: message = FALSE, warning = FALSE, results = "hide". These R Markdown settings let you suppress messages and warnings which would otherwise appear below the chunk. Another way to avoid the lengthy messages during package loading, is to wrap R commands inside suppressMessages() function, e.g. suppressMessages(library("foo")).

pkgs_needed <- c("tidyverse", "Biostrings", 
                 "parathyroidSE", "EnsDb.Hsapiens.v86", "nutshell")
letsinstall = setdiff(pkgs_needed, installed.packages()) 
if (length(letsinstall) > 0) {
  BiocManager::install(letsinstall)
}
library("dplyr")
library("ggplot2")
library("Biostrings")

Load Data

First, we will use a data set about births in the USA in 2006. You can read more about this in the "R in a Nutshell, 2nd edition" book which is freely available as a PDF file online. This is convenient since you can follow the same methods in that book (except translating into the grammar of graphics to make plots). You can download the dataset here from the course's website.

First, we load the dataset and add two new variables: one to encode whether the day falls on a weekend or whether it is a weekday, and a three-level, discretized health score based on the APGAR 5 score.

We also fix an ideosyncratic encoding of not available values by 99 with proper NAs.

Finally, we define a subsampled version of the dataset births_small to speed up some of the plotting (feel free to try out the computations and plots we see subsequently with the full dataset births.)

load("~/Downloads/births2006.smpl.rda")

births <- mutate(births2006.smpl,
  WEEKEND = ifelse(DOB_WK %in% c(1, 7), "Weekend", "Weekday"),
  HEALTH  = c("CritLow", "Low", "Normal")[ 1 +
    findInterval(APGAR5, c(3.5, 6.5)) ],
  ESTGEST = replace(ESTGEST, ESTGEST==99, NA))

set.seed(1)
births_small <- births[ sample(nrow(births), 40000), ]
head(births_small)
##        DOB_MM DOB_WK MAGER TBO_REC WTGAIN SEX APGAR5                 DMEDUC
## 24388       2      2    24       3     44   F      9 2 years of high school
## 124413      6      5    21       2     39   M      5                   NULL
## 331730      3      3    29       2     29   F      9     2 years of college
## 142643      9      5    35       4     NA   F     NA                   NULL
## 25173       6      4    35       4     NA   F     NA                   NULL
## 294762      7      4    33       4     60   M      9 2 years of high school
##        UPREVIS ESTGEST DMETH_REC  DPLURAL DBWT WEEKEND HEALTH
## 24388       10      37   Vaginal 1 Single 2778 Weekday Normal
## 124413       8      39   Vaginal 1 Single 3160 Weekday    Low
## 331730      16      37 C-section 1 Single 3040 Weekday Normal
## 142643      13      NA C-section 1 Single 3317 Weekday   <NA>
## 25173       10      NA   Vaginal 1 Single 2778 Weekday   <NA>
## 294762      15      40   Vaginal 1 Single 3969 Weekday Normal

Plot initialization & geom's (Wednesday)

ggplot2 is a plotting system for R, based on the grammar of graphics. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics 1

You can think of plots generated with ggplot2 as sentences. The function ggplot() starts a sentence (initializes a plot) which usually contains a noun/subject (a dataset with information to be plotted).

The + operator is used to add further clauses/fragments containing "verbs", "adjectives", "adverbs" etc. to the sentence. This allows you to construct sophisticated plots from a few elementary building blocks, just like forming compound sentences from simple phrases.

In ggplot world, verbs are geometric objects, or geom's. For example, geom_bar() draws bars, geom_point() draws points. There can be multiple geoms in a sentence, and these are then also called layers.

For each geom, you need to define required aesthetics using the function aes(). The aesthetics of a geom determine how the attributes of the geometric object (e.g. the x- and y- coordinates of the points, their colors and shapes, etc.) are mapped to the columns of the supplied data.frame. The aesthetics mapping can be defined when initializing a plot (with ggplot(data = dataframe_name, aes(...))), which makes it apply to all geom's. Otherwise, one can specify aes() mapping for each geom separately (e.g. geom_bar(aes(...))), in which case it applies to that particular geom only.

It is also possible to use a different dataset for geom than the one supplied to ggplot(). To do this you simply supply a new data.frame as a data argument of geom (e.g. geom_point(data = new_dataframe_name, aes(...))) which would overwrite the dataset used to this geom.

For example, let's look at the number of births on each day of the week in our sample data. First set up our plot with the data births.

ppp <- ggplot(births_small) 

Note that this doesn't actually plot anything (just like if you wrote junk <- 7 doesn't output anything until you run junk.

Question : What happens here when you run ppp?

The idea is that you then need to add the objects that you want to plot to ppp -- a little like legos.

Let's start of easy by plotting a scatter plot, to understand the relation between weight gain and estimated gestational time. To create a geometric object layer for a scatter plot we use the function geom_point().In order for it to know what part of the data we want to actually plot, we need to give an aesthetic. We can do this by declaring that **the aesthetic for the x-axis is the estimated gestation time and y is the weight gain*.

print(ppp + geom_point(aes(x=ESTGEST, y=WTGAIN)))
## Warning: Removed 7194 rows containing missing values (geom_point).

The number of births per day of the week can be shown easily in a barplot, so let's use that. To create a geometric object layer for a barplot we use the function geom_bar(). In order for it to know what part of the data we want to actually plot, we need to give an aesthetic. We can do this by declaring that the aesthetic for the x-axis is the day of the week of the birth.

The column DOB_WK gives the day of the week that each birth happened as a numeric value 1 (Sunday) through 7 (Saturday). We can tell R that this is actually a factor variable by putting the variable name in the function factor. Putting this all together, we get geom_bar(aes(x = factor(DOB_WK)). Finally, to add this layer to the initialized graph, we add the geom to ppp with the + operator.

ppp <- ggplot(births_small) 
ppp + geom_bar(aes(x = factor(DOB_WK)))

Doctors are able to delay birth with anti-contraction medications or labor suppressants called tocolytics. That might be one reason we see fewer births on day 1 of the week (Sunday) and day 7 (Saturday).

We can get further information from this plot if we add more aesthetics. For example, maybe we can fill each bar with different colors corresponding to what type of birth it was ("C-section", "Vaginal", or "Unknown"). We can do this by just including another aes in the geometric object. Start with the same initialization, but add a geometric object with the x-axis and also fill color defined.

ppp <- ggplot(births_small) 
ppp + geom_bar(aes(x = factor(DOB_WK), fill = DMETH_REC))

When we made that plot, we used the default value for the position argument of the geom_bar function. We could have equivalently written the code as follows.

ppp <- ggplot(births_small) 
ppp + geom_bar(aes(x = factor(DOB_WK), fill = DMETH_REC), position = "stack")

Another option is to use position = "dodge". Note that this is an argument to geom_bar and not to aes.

ppp <- ggplot(births_small) 
ppp + geom_bar(aes(x = factor(DOB_WK), fill = DMETH_REC), position = "dodge")

Now we see that about 1/2 as many C-sections were on weekends as there were on a typical weekday, whereas there were about 3/4 as many vaginal births on weekends as there were on weekdays.

Facets (Wednesday)

Let's continue looking at the birth data on a day to day basis. We might conjecture that older women are less likely to take a tocolytic since they are more likely to have had other complications already. One way we can do this is to "facet" the graph and display day of the week versus women's age.

First, let's make a histogram of the women's ages to get an idea of what the distribution looks like.

ppp <- ggplot(births_small) 
ppp + geom_histogram(aes(x = MAGER), binwidth = 1)

We used the argument binwidth to set the width of the bins in this histogram (here binwidth = 1 corresponds to 1 year).

Using the grammar of graphics, it is easy to facet this single graph to make multiple graphs along another dimension of the data. In this case, we're interested in breaking this up along the dimension of day of birth DOB_WK. We will add this facet with the command facet_grid or facet_wrap. In facet_grid, the argument we use is a formula with the rows (of the tabular display) on the left hand side and the columns (of the tabular display) on the right hand side (RHS).

A formula is created in R with the tilde operator ~. A dot in the formula is used to indicate there should be no faceting on this dimension (either row or column). The formula can also be provided as a string instead of a classical formula object. In facet_wrap, we have the same sort of argument, but we only include a RHS of the formula. We'll use both of them in an example so you can see the difference.

Now let's look at this faceting on that variable. Again, we will use the + operator. Here, we also see that we can save the geometric objects in the plot and just add facets at the end.

ppph <- ggplot(births_small) + 
   geom_histogram(aes(x = MAGER, fill = DMETH_REC), binwidth = 1)
ppph + facet_wrap( ~ DOB_WK)

ppph + facet_grid(DOB_WK ~ SEX)

What is the difference between facet_wrap and facet_grid?

Here is an interesting perspective of the data (we use dplyr::filter to exclude the record where the delivery method is Unknown).

ggplot(dplyr::filter(births, !(DMETH_REC %in% "Unknown"))) +
  geom_histogram(aes(x = MAGER, fill = factor(TBO_REC)), binwidth = 1) +
  facet_grid(WEEKEND ~ DMETH_REC, scale="free_y", drop = TRUE) +
  geom_vline(xintercept = seq(15, 45, by=5), alpha=0.2, color="white") +
  labs(title = "Births in USA 2006", fill="Birth\nOrder")

Statistics (Wednesday)

It's often useful to transform your data before plotting, and that's what statistical transformations do.

Here we look at the histogram of mother's ages as before, but we also add a density estimate to it.

ggplot(births_small, aes(x = MAGER)) + 
  geom_histogram(aes(y = ..density..), binwidth = 1, fill = "grey", col = "black") +
  stat_density(col = "red", fill = NA)

Maybe we want to compare this to a lognormal distribution's density.

ggplot(births_small, aes(x = MAGER)) + 
  geom_histogram(aes(y = ..density..), binwidth = 1, fill="grey", col="black") +
  stat_density(col="red", fill=NA) +
  stat_function(fun = dlnorm, col = "blue", 
                args = list(mean = mean(log(births_small$MAGER)),
                             sd =    sd(log(births_small$MAGER))))

You can add any function using the stat_function.

In this example, we will plot the birth weight in grams (DBWT) versus weight gain (WTGAIN) by mother (which seems to be in pounds). Unfortunately, we need to be a little more careful about this when there are NAs in the data.

ppp2 <- ggplot(dplyr::filter(births_small, !is.na(WTGAIN) & !is.na(DBWT)), 
               aes(x = WTGAIN, y = DBWT)) +
  labs(x = "Weight Gain by Mother", y = "Birth Weight in Grams")
ppp2 + geom_point()

When we plot this data by itself, it is not clear what is going on -- there are many points plotted on top of each other and it just looks messy.

One way to transform this data with a stat_bin2d() in ggplot2 is by binning the points as below.

ppp2 + stat_bin2d()

See that the color axis reports counts. That means we count the total number of observations to fall within a box, i.e. we are effectively generating a 2D density plot. This plot seems to indicate that the joint distribution of weight gain by mother and birth weight in grams is unimodal.

Instead of just making counts though, we can compute other quantities (evaluate functions other than count) within each of the 2d bins on other two columns of births_small. For example, let's look at the median number of weeks of gestation for each of the bins.

ppp2 + stat_summary_2d(aes(z = ESTGEST), fun = median) + 
  labs(title = "median number of weeks of gestation")
## Warning: Removed 55 rows containing non-finite values (stat_summary2d).

If we want to do a regression, we can include this automatically. By default, ggplot2 uses "locally weighted scatterplot smoothing" (loess) regression for data sets with < 1000 points and "generalized additive models" (GAM) for >= 1000 points.

# here we force a linear model fit
ppp2 + geom_point() + stat_smooth(method = lm) 
## `geom_smooth()` using formula 'y ~ x'

By changing the color aesthetics of the modelled line, we can fit separate models to the corresponding subsets of the data and compare them. For example, let's look at a smoothed fit (GAM) of birthweight versus weight gain by mother separated out by baby boys and girls.

ppp2 + geom_point() + stat_smooth(aes(col = SEX))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Finally, let's look at hexagonal bins as a visually more attractive alternative to the rectangular bins used so far

ppp2 + geom_hex(bins = 30) + stat_smooth(aes(col = SEX))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Scales (Wednesday)

Scales control the mapping between data and aesthetics. Going back to the 2d distribution of birthweight versus weight gain by mothers, it is difficult to see what is going on except that there is a dense region and a less dense region. If we take the square root of the number of births per region, then we can see that there is a smooth transition between the high density area and the low density area.

# we repeat again the code above so you don't gave to scroll to look at it.
ppp2 <- ggplot(dplyr::filter(births_small, !is.na(WTGAIN) & !is.na(DBWT)), 
               aes(x = WTGAIN, y = DBWT)) +
  labs(x = "Weight Gain by Mother", y = "Birth Weight in Grams")

ppp2 + stat_bin2d() + scale_fill_gradient(trans = "sqrt")

See what happens when you change the trans to log10.

Sometimes, we might want to change the scale of the vertical axis. According to Wikipedia's page on "Fetal Viability", there is a 50% chance of viability at 24 weeks of gestation. We will repeat the plot above with birth weight on a log scale, so we get better separation of the mean weeks of gestation.

ppp2 + stat_summary_2d(aes(z = ESTGEST), fun = mean) + 
  scale_y_log10(limits = 10^c(2, 4)) +
  scale_fill_gradient2(midpoint = 24) +
  labs(title="mean number of weeks of gestation")
## Warning: Removed 55 rows containing non-finite values (stat_summary2d).

Now let's look at only the quadruplets in the full dataset (there are 39 such observations). We want to include lots of variables such as number of prenatal visits (UPREVIS), the mother's age (MAGER), the estimated weeks of gestation (ESTGEST), the delivery method (DMETH_REC), and the mother's education level (DMEDUC).

ppp3 <- ggplot(dplyr::filter(births, DPLURAL == "4 Quadruplet"), 
               aes(x = UPREVIS, y = MAGER)) + 
  geom_point(aes(size = ESTGEST, shape = DMETH_REC, col = DMEDUC)) +
  stat_smooth(aes(col = DMETH_REC), method = "lm")
ppp3
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 4 rows containing missing values (geom_point).

ppp3 + scale_size(range=c(3, 6)) + scale_color_brewer(palette = "Set1") +
  scale_shape(solid = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 4 rows containing missing values (geom_point).

Note that, while shapes are discrete, colors and sizes can both be scaled continuously.

Quantile-quantile plot (Wednesday)

Remember that in the the power calculations section from the last lab, we defined a statistic estimating the deviation of the observed to expected frequencies of nucleotides. The code for generating the statistics assuming it came from the null distribution (i.e. when all 4 nucleotides are evenly likely).

oestat = function(o, e){
  sum( (e-o)^2/e )
}

set.seed(1)
B = 10000
# here we pick an arbitrary length / not the same as for Celegans
n = 2847
expected = rep(n/4 ,4)
oenull = replicate(
  B, oestat(e=expected, o=rmultinom(1,size = n, prob = rep(1/4,4))))

Now, we can estimate the null distribution of this statistics, by plotting a histogram of the generated values:

ggplot(data.frame(null_stats = oenull)) +
  geom_histogram(aes(x = null_stats), bins = 100, boundary=0)

We compare the distribution of this statistic to \(Chi^2_3\) (df = 1 * (4-1) = 3) using a q-q plot:

 ggplot(data.frame(stat = oenull), aes(sample = stat)) +
   stat_qq(distribution = stats::qchisq, dparams = list(df = 3)) +
   stat_qq_line(distribution = stats::qchisq, dparams = list(df = 3)) 

Parathyroid Example (Wednesday)

We will use the parathyroidGenesSE package in R. Load the data and read the experimental information and the abstract.

library("parathyroidSE")
library("EnsDb.Hsapiens.v86")

data("parathyroidGenesSE", package = "parathyroidSE")
metadata(parathyroidGenesSE)$MIAME 
## Experiment data
##   Experimenter name: Felix Haglund 
##   Laboratory: Science for Life Laboratory Stockholm 
##   Contact information: Mikael Huss 
##   Title: DPN and Tamoxifen treatments of parathyroid adenoma cells 
##   URL: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37211 
##   PMIDs: 23024189 
## 
##   Abstract: A 251 word abstract is available. Use 'abstract' method.
abstract(metadata(parathyroidGenesSE)$MIAME)
## [1] "Primary hyperparathyroidism (PHPT) is most frequently present in postmenopausal women. Although the involvement of estrogen has been suggested, current literature indicates that parathyroid tumors are estrogen receptor (ER) alpha negative. Objective: The aim of the study was to evaluate the expression of ERs and their putative function in parathyroid tumors. Design: A panel of 37 parathyroid tumors was analyzed for expression and promoter methylation of the ESR1 and ESR2 genes as well as expression of the ERalpha and ERbeta1/ERbeta2 proteins. Transcriptome changes in primary cultures of parathyroid adenoma cells after treatment with the selective ERbeta1 agonist diarylpropionitrile (DPN) and 4-hydroxytamoxifen were identified using next-generation RNA sequencing. Results: Immunohistochemistry revealed very low expression of ERalpha, whereas all informative tumors expressed ERbeta1 (n = 35) and ERbeta2 (n = 34). Decreased nuclear staining intensity and mosaic pattern of positive and negative nuclei of ERbeta1 were significantly associated with larger tumor size. Tumor ESR2 levels were significantly higher in female vs. male cases. In cultured cells, significantly increased numbers of genes with modified expression were detected after 48 h, compared to 24-h treatments with DPN or 4-hydroxytamoxifen, including the parathyroid-related genes CASR, VDR, JUN, CALR, and ORAI2. Bioinformatic analysis of transcriptome changes after DPN treatment revealed significant enrichment in gene sets coupled to ER activation, and a highly significant similarity to tumor cells undergoing apoptosis. Conclusions: Parathyroid tumors express ERbeta1 and ERbeta2. Transcriptional changes after ERbeta1 activation and correlation to clinical features point to a role of estrogen signaling in parathyroid function and disease."

Parathyroid adenoma (http://en.wikipedia.org/wiki/Parathyroid_adenoma) is a benign tumor of the parathyroid gland. The abstract tells us that some interesting genes to look at are the following:

  • Estrogen related genes: ESR1, ESR2.
  • Parathyroid related genes: CASR, VDR, JUN, CALR, ORAI2.

Let's put them in a table.

genes <- read.csv(textConnection(
  "name, group
   ESR1,  estrogen
   ESR2,  estrogen
   CASR,  parathyroid
   VDR,   parathyroid
   JUN,   parathyroid
   CALR,  parathyroid
   ORAI2, parathyroid"), 
  stringsAsFactors = FALSE, strip.white = TRUE)

In the parathyroidGenesSE object, the features are labeled with Ensembl gene identifiers, so let's use a Bioconductor package to find the corresponding IDs.

ens <- ensembldb::select(EnsDb.Hsapiens.v86,
  keys = list(GenenameFilter(genes$name), 
              TxBiotypeFilter("protein_coding")),
  columns = c("GENEID", "GENENAME"))
## Warning: 'GenenameFilter' is deprecated.
## Use 'GeneNameFilter' instead.
## See help("Deprecated")
## Note: ordering of the results might not match ordering of keys!
ens <- 
  dplyr::filter(ens, GENEID %in% rownames(parathyroidGenesSE)) %>%
  mutate(group = genes$group[match(GENENAME, genes$name)])

ens
##            GENEID GENENAME      TXBIOTYPE       group
## 1 ENSG00000179218     CALR protein_coding parathyroid
## 2 ENSG00000036828     CASR protein_coding parathyroid
## 3 ENSG00000091831     ESR1 protein_coding    estrogen
## 4 ENSG00000140009     ESR2 protein_coding    estrogen
## 5 ENSG00000177606      JUN protein_coding parathyroid
## 6 ENSG00000160991    ORAI2 protein_coding parathyroid
## 7 ENSG00000111424      VDR protein_coding parathyroid

Make the table of gene counts, add the patient info:

countData <- assay( parathyroidGenesSE ) 
gene.counts <- t(countData[ens$GENEID, ])
colnames(gene.counts) <- ens$GENENAME
dat <- cbind(data.frame(colData( parathyroidGenesSE)), data.frame(gene.counts))
head(dat)
##         run experiment patient treatment time submission     study    sample
## 1 SRR479052  SRX140503       1   Control  24h  SRA051611 SRP012167 SRS308865
## 2 SRR479053  SRX140504       1   Control  48h  SRA051611 SRP012167 SRS308866
## 3 SRR479054  SRX140505       1       DPN  24h  SRA051611 SRP012167 SRS308867
## 4 SRR479055  SRX140506       1       DPN  48h  SRA051611 SRP012167 SRS308868
## 5 SRR479056  SRX140507       1       OHT  24h  SRA051611 SRP012167 SRS308869
## 6 SRR479057  SRX140508       1       OHT  48h  SRA051611 SRP012167 SRS308870
##   CALR  CASR ESR1 ESR2  JUN ORAI2  VDR
## 1 5055 14525    2    1 1799   632 1167
## 2 6161 16870    3    3 1153  1424 1653
## 3 3333  7798    1    1 1086   309  688
## 4 6407 14299    2    1 1227   974 1407
## 5 3810  9235    4    1 1258   326  702
## 6 4885 12994    2    3  906   814 1235

Plot one of the estrogen related gene's counts (ESR1) with plot aesthetics and faceting to separate patients, treatments and times.

ggplot(dat, aes(col = patient, x = treatment, y = ESR1)) +
  geom_point(size = 3) + 
  facet_grid( . ~ time)

Questions (Wednesday)

Answer the following questions.

From the plot of the parathyroid data, answer the following.

Quiz question 1 : How many patients are there?

Quiz question 2 : How many time points are there?

Quiz question 3 : There were 3 treatments: "Control", "DPN", and "OHT". How many measurements were taken from patient 2 under the DPN treatment?

Make your own plot of VDR versus CASR. (That is CASR, not CALR).

Quiz question 4 : Which patient has the highest recorded level of CASR?

Quiz question 5 : Which of the following pairs of patients seem to be well separated in this plot (i.e., for which two patients you can draw a line on the plot that perfectly separates them)?

Quiz question 6 : You need to know which pairs of patients are well separated with respect to the genes VDR and CASR (i.e., you can draw a line on the plot that perfectly separates the patients). Which of the following methods can help you visualize this?

Quiz question 7 : Which patient looks different from the other three when you plot VDR versus ORAI2?

Quiz question 8 : Plot ORAI2 versus JUN. Which can you separate best?

Quiz question 9 : Plot CASR versus (ESR1 + ESR2). Fit a separate linear model for each treatment (Control, DPN, OHT). Which linear models are increasing?

Quiz question 10 : What is the maximum number of shapes that you are allowed to use in ggplot2 by default?

Quiz question 11 : Write the name of the function that you could use to make more than the maximum number of default shapes allowed. Hint: this function has "values" as one of the arguments ____(..., values = (...)).

Quiz question 12 : What do Themes do in ggplot2?

CpG Islands (Thursday)

Let's walk through a last example.

pkgs_needed <- c("BSgenome.Hsapiens.UCSC.hg19", "Gviz")
letsinstall = setdiff(pkgs_needed, installed.packages()) 
if (length(letsinstall) > 0) {
  BiocManager::install(letsinstall)
}

This is additional material, which wasn't covered in class but is included in the textbook. You will benefit from completing this part.

GC content is the percentage of the genome (or DNA fragment) that is "G" or "C". To compute the GC content, we count the occurrences of the "G" and "C" alphabets, and divide by the length of the string in question.

We will be using data from chr8 of the human genome version 19 from the UCSC genome repository.

A genomic window is defined as a CpG island if its GC content>50% and observed-to-expected CG ratio>0.6. CpG islands are said to mark important regions in the genome because over 65% of gene promoter regions can be found in with CpG islands.

We want to look at this for the Human Chromosome 8. We can use Bioconductor to get the data with the command BiocManager::install("BSgenome.Hsapiens.UCSC.hg19").

library(BSgenome.Hsapiens.UCSC.hg19)
## Loading required package: BSgenome
## Loading required package: rtracklayer
chr8  =  Hsapiens$chr8
cpg_url <- url("http://web.stanford.edu/class/bios221/data/model-based-cpg-islands-hg19.txt")
CpGtab <- read.table(cpg_url, header=T)
nrow(CpGtab)
## [1] 65699

Then we retain only the start and stop positions for chromosome 8 and convert into an IRanges object:

irCpG = with(dplyr::filter(CpGtab, chr == "chr8"),
         IRanges(start = start, end = end))

We create the biological context with the next line. The “I” in IRanges stands for “interval”; the “G” in GRanges for “genomic”.

grCpG = GRanges(ranges = irCpG, seqnames = "chr8", strand = "+")
genome(grCpG) = "hg19"
library("Gviz") # you might need to install this from Bioconductor
## Loading required package: grid
## 
## Attaching package: 'Gviz'
## The following object is masked from 'package:AnnotationFilter':
## 
##     feature
ideo = IdeogramTrack(genome = "hg19", chromosome = "chr8")
plotTracks(
  list(GenomeAxisTrack(),
    AnnotationTrack(grCpG, name = "CpG"), ideo),
    from = 2200000, to = 5800000,
    shape = "box", fill = "#006400", stacking = "dense")

Views on the chromosome sequence correspond to the CpG islands, irCpG, and to the regions in between (gaps(irCpG)).

CGIview    = Views(unmasked(Hsapiens$chr8), irCpG)
NonCGIview = Views(unmasked(Hsapiens$chr8), gaps(irCpG))

We compute transition counts in CpG islands and non-islands using the data.

seqCGI      = as(CGIview, "DNAStringSet")
seqNonCGI   = as(NonCGIview, "DNAStringSet")
dinucCpG    = sapply(seqCGI, dinucleotideFrequency)
dinucNonCpG = sapply(seqNonCGI, dinucleotideFrequency)
dinucNonCpG[, 1]
##  AA  AC  AG  AT  CA  CC  CG  CT  GA  GC  GG  GT  TA  TC  TG  TT 
## 389 351 400 436 498 560 112 603 359 336 403 336 330 527 519 485
NonICounts = rowSums(dinucNonCpG)
IslCounts  = rowSums(dinucCpG)

For a four state Markov chain as we have, we define the transition matrix as a matrix where the rows are the from state and the columns are the to state.

TI  = matrix( IslCounts, ncol = 4, byrow = TRUE)
TnI = matrix(NonICounts, ncol = 4, byrow = TRUE)
dimnames(TI) = dimnames(TnI) =
  list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))

We use the counts of numbers of transitions of each type to compute frequencies and put them into two matrices.

MI = TI /rowSums(TI)
MI
##            A         C         G         T
## A 0.20457773 0.2652333 0.3897678 0.1404212
## C 0.20128250 0.3442381 0.2371595 0.2173200
## G 0.18657245 0.3145299 0.3450223 0.1538754
## T 0.09802105 0.3352314 0.3598984 0.2068492
MN = TnI / rowSums(TnI)
MN
##           A         C          G         T
## A 0.3351380 0.1680007 0.23080886 0.2660524
## C 0.3641054 0.2464366 0.04177094 0.3476871
## G 0.2976696 0.2029017 0.24655406 0.2528746
## T 0.2265813 0.1972407 0.24117528 0.3350027

Are the relative frequencies of the different nucleotides different in CpG islands compared to elsewhere ?

freqIsl=alphabetFrequency(seqCGI,baseOnly=TRUE,collapse=TRUE)[1:4]
freqIsl / sum(freqIsl)
##         A         C         G         T 
## 0.1781693 0.3201109 0.3206298 0.1810901
freqNon=alphabetFrequency(seqNonCGI,baseOnly=TRUE,collapse=TRUE)[1:4]
freqNon / sum(freqNon)
##         A         C         G         T 
## 0.3008292 0.1993832 0.1993737 0.3004139

This shows an inverse pattern: in the CpG islands, C and G have frequencies around 0.32, whereas in the non-CpG islands, we have A and T that have frequencies around 0.30.

But how can we decide if a new sequence comes from a CpG island? What is the probability it belongs to a CpG island compared to somewhere else? We compute a score based on what is called the odds ratio, i.e $( $P(sequence | island)/P(sequence| nonisland) \()\). See the textbook for details as to how this is evaluated.

alpha = log((freqIsl/sum(freqIsl)) / (freqNon/sum(freqNon)))
beta  = log(MI / MN)

scorefun = function(x) {
  s = unlist(strsplit(x, ""))
  score = alpha[s[1]]
  if (length(s) >= 2)
    for (j in 2:length(s))
      score = score + beta[s[j-1], s[j]]
  score
}

x = "ACGTTATACTACG"
scorefun(x)
##          A 
## -0.2824623

In the code below, we pick sequences of length len = 100 out of the 2855 sequences in the seqCGI object, and then out of the 2854 sequences in the seqNonCGI object (each of them is a DNAStringSet). In the first three lines of the generateRandomScores function, we drop sequences that contain any letters other than A, C, T, G; such as “.” (a character used for undefined nucleotides). Among the remaining sequences, we sample with probabilities proportional to their length minus len and then pick subsequences of length len out of them. The start points of the subsequences are sampled uniformly, with the constraint that the subsequences have to fit in.

generateRandomScores = function(s, len = 100, B = 1000) {
  alphFreq = alphabetFrequency(s)
  isGoodSeq = rowSums(alphFreq[, 5:ncol(alphFreq)]) == 0
  s = s[isGoodSeq]
  slen = sapply(s, length)
  prob = pmax(slen - len, 0)
  prob = prob / sum(prob)
  idx  = sample(length(s), B, replace = TRUE, prob = prob)
  ssmp = s[idx]
  start = sapply(ssmp, function(x) sample(length(x) - len, 1))
  scores = sapply(seq_len(B), function(i)
    scorefun(as.character(ssmp[[i]][start[i]+(1:len)]))
  )
  scores / len
}
scoresCGI    = generateRandomScores(seqCGI)
scoresNonCGI = generateRandomScores(seqNonCGI)

Now, we can use ggplot() to view the distribution of computed scores for the two distributions:

df <- rbind(
  data.frame(region = "cgi", score = scoresCGI),
  data.frame(region = "ncgi", score = scoresNonCGI)
)

ggplot(df, aes(x = score, fill = region)) +
  geom_histogram(alpha = 0.5, color = "black", position = "identity")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In class-exercise (Thursday)

We will walk through the following exercise in class. Feel free to have a look beforehand, but we will solve it in groups during the Thursday session.

Exercise 1: Customized scatter plot

You will try to recreate a plot from an Economist article showing the relationship between well-being and financial inclusion.

You can find the accompanying article at this link

The data for the exercises EconomistData.csv can be downloaded from the class github repository.

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.3     ✓ purrr   0.3.4
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x Biostrings::collapse()   masks IRanges::collapse(), dplyr::collapse()
## x Biobase::combine()       masks BiocGenerics::combine(), dplyr::combine()
## x purrr::compact()         masks XVector::compact()
## x matrixStats::count()     masks dplyr::count()
## x IRanges::desc()          masks dplyr::desc()
## x tidyr::expand()          masks S4Vectors::expand()
## x ensembldb::filter()      masks dplyr::filter(), stats::filter()
## x S4Vectors::first()       masks dplyr::first()
## x dplyr::lag()             masks stats::lag()
## x BiocGenerics::Position() masks ggplot2::Position(), base::Position()
## x purrr::reduce()          masks GenomicRanges::reduce(), IRanges::reduce()
## x S4Vectors::rename()      masks dplyr::rename()
## x ensembldb::select()      masks AnnotationDbi::select(), dplyr::select()
## x purrr::simplify()        masks DelayedArray::simplify()
## x XVector::slice()         masks IRanges::slice(), dplyr::slice()
url <- paste0("https://raw.githubusercontent.com/biox-rbootcamp/biox-rbootcamp.github.io/master/assets/data/EconomistData.csv")
dat <- read_csv(url)
## Parsed with column specification:
## cols(
##   Country = col_character(),
##   SEDA.Current.level = col_double(),
##   SEDA.Recent.progress = col_double(),
##   Wealth.to.well.being.coefficient = col_double(),
##   Growth.to.well.being.coefficient = col_double(),
##   Percent.of.15plus.with.bank.account = col_double(),
##   EPI_regions = col_character(),
##   Region = col_character()
## )
head(dat)
## # A tibble: 6 x 8
##   Country SEDA.Current.le… SEDA.Recent.pro… Wealth.to.well.… Growth.to.well.…
##   <chr>              <dbl>            <dbl>            <dbl>            <dbl>
## 1 Albania             50               63.3             1.27             1.31
## 2 Algeria             40.6             46.5             0.87             1.03
## 3 Angola              17.8             76.2             0.54             1.21
## 4 Argent…             54.1             49.1             0.91             0.89
## 5 Armenia             43.8             46               1.25             1.11
## 6 Austra…             87.9             40.9             1.07             0.92
## # … with 3 more variables: Percent.of.15plus.with.bank.account <dbl>,
## #   EPI_regions <chr>, Region <chr>
  1. Create a scatter plot similar to the one in the article, where the x axis corresponds to percent of people over the age of 15 with a bank account (the Percent.of.15plus.with.bank.account column) and the y axis corresponds to the current SEDA score SEDA.Current.level.
  2. Color all points blue.
  3. Color points according to the Region variable.
  4. Overlay a fitted smoothing trend on top of the scatter plot. Try to change the span argument in geom_smooth to a low value and see what happens.
  5. Overlay a regression line on top of the scatter plot Hint: use geom_smooth with an appropriate method argument.
  6. Facet the previous plot by Region.

Part 2: Distribution of categorical variables

  1. Generate a bar plot showing the number of countries included in the dataset from each Region.
  2. Rotate the plot so the bars are horizontal

Part 3: Distribution of continuous variables

  1. Create boxplots of SEDA scores, SEDA.Current.level separately for each Region.
  2. Overlay points on top of the box plots
  3. The points you added are on top of each other, in order to distinguish them jitter each point by a little bit in the horizontal direction.
  4. Now substitute your boxplot with a violin plot.

In-class Exercise 2 (time allowing)

The following url: "http://cdiac.ess-dive.lbl.gov/ftp/ndp030/CSV-FILES/nation.1751_2014.csv" contains data on fossil fuel emissions. Read data to R. Note that rows 1-3 contain information on the dataset itself. We delete these rows as they do not contain relevant information.

library(readr)
url <- "http://cdiac.ess-dive.lbl.gov/ftp/ndp030/CSV-FILES/nation.1751_2014.csv"
emissions <- read_csv(url)
## Parsed with column specification:
## cols(
##   Nation = col_character(),
##   Year = col_double(),
##   `Total CO2 emissions from fossil-fuels and cement production (thousand metric tons of C)` = col_double(),
##   `Emissions from solid fuel consumption` = col_double(),
##   `Emissions from liquid fuel consumption` = col_double(),
##   `Emissions from gas fuel consumption` = col_double(),
##   `Emissions from cement production` = col_double(),
##   `Emissions from gas flaring` = col_character(),
##   `Per capita CO2 emissions (metric tons of carbon)` = col_character(),
##   `Emissions from bunker fuels (not included in the totals)` = col_double()
## )
## Warning: 615 parsing failures.
##  row                                    col   expected    actual                                                                      file
##    1 NA                                     10 columns 1 columns 'http://cdiac.ess-dive.lbl.gov/ftp/ndp030/CSV-FILES/nation.1751_2014.csv'
##    2 NA                                     10 columns 1 columns 'http://cdiac.ess-dive.lbl.gov/ftp/ndp030/CSV-FILES/nation.1751_2014.csv'
##    3 NA                                     10 columns 1 columns 'http://cdiac.ess-dive.lbl.gov/ftp/ndp030/CSV-FILES/nation.1751_2014.csv'
## 3471 Emissions from solid fuel consumption  a double   .         'http://cdiac.ess-dive.lbl.gov/ftp/ndp030/CSV-FILES/nation.1751_2014.csv'
## 3471 Emissions from liquid fuel consumption a double   .         'http://cdiac.ess-dive.lbl.gov/ftp/ndp030/CSV-FILES/nation.1751_2014.csv'
## .... ...................................... .......... ......... .........................................................................
## See problems(...) for more details.
emissions <- emissions[-(1:3), ]
emissions[emissions == "."] <- NA
emissions <- type_convert(emissions)
## Parsed with column specification:
## cols(
##   Nation = col_character(),
##   `Emissions from gas flaring` = col_double(),
##   `Per capita CO2 emissions (metric tons of carbon)` = col_double()
## )
emissions
## # A tibble: 17,232 x 10
##    Nation  Year `Total CO2 emis… `Emissions from… `Emissions from…
##    <chr>  <dbl>            <dbl>            <dbl>            <dbl>
##  1 AFGHA…  1949                4                4                0
##  2 AFGHA…  1950               23                6               18
##  3 AFGHA…  1951               25                7               18
##  4 AFGHA…  1952               25                9               17
##  5 AFGHA…  1953               29               10               18
##  6 AFGHA…  1954               29               12               18
##  7 AFGHA…  1955               42               17               25
##  8 AFGHA…  1956               50               17               33
##  9 AFGHA…  1957               80               21               59
## 10 AFGHA…  1958               90               25               65
## # … with 17,222 more rows, and 5 more variables: `Emissions from gas fuel
## #   consumption` <dbl>, `Emissions from cement production` <dbl>, `Emissions
## #   from gas flaring` <dbl>, `Per capita CO2 emissions (metric tons of
## #   carbon)` <dbl>, `Emissions from bunker fuels (not included in the
## #   totals)` <dbl>
  1. Use dplyr functions to compute the total yearly \(CO_2\) emissions (column Total.CO2.emissions.from.fossil.fuels.and.cement.production..thousand.metric. tons.of.C.) summed over all countries (the world total \(CO_2\) emission). Use the dataset to plot the World's yearly \(CO_2\) emission in Gt.

  2. Find the top 10 countries with highest emission after year 2000 (including 2000). Plot the yearly total CO2 emissions of these top 10 countries with a different color for each country. Use billion tonnes (Gt) units, i.e. divide the total emissions by 10^6.

  3. Use geom_area() to generate a plot similar to the one you generated above but, with emission levels stacked on top of each other (summing to the total for the ten countries) with areas colored by countries.